Forest cover types and land cover plays an key role in environmental assessment. Accurate information of natural resources is important to many different entities like private, local government, federal agencies, conservation agencies. Normally, land cover data are generated by remote sensing data. However, those data set can be hard and costly to process. Therefore, we can try to use cartographic data to predict forest cover types. There are various supervised classification algorithms we can utilize in this dataset, included K-Nearest Neighbors, Support Vector Machine, Tree based methods, Neural Network. In this project, we will try to use as many methods as we can and compare the results. The original effect to classify the forest cover type on this dataset was able to achieve 70.52% classification accuracy using artificial neural network.
The main problem we are trying to solve is how we can predict forest cover type based on cartographic information. Which model performs the best on this classification task? Can we apply the same model to different regions? The results can be applied in some other analysis like fire hazard prevention, nature asset management, climate change, etc.
Another problem I am trying to address here is the trade-off of using remote sensing data. In many cases, remote sensing data is useful and beneficial. It can cover a large number of areas and places humans cannot reach in person. It also has the temporal element, allowing us to see the dynamic change of the environment. However, it also comes with the disadvantage. The files are too really large, so it requires good processing power. Also, remote sensing can be interfered by other phenomena like the weather. The main goal here is to see whether we can predict tree type just use cartographic information. If I have time, I will also try to combine remote sensing data with basic cartographic information. Would the prediction perform better with data from different origins and dimensions?
This project includes a supervised classification task. We will randomly devide our dataset in to training and tesing. Cross validation will be performed on the training dataset to get the best hyperparameter for each model. The main objective of this project is to find a method with high test accuracy on classifying forest cover types. The second objective is to achieve a high score on Kaggle's competition. The third objective is to use remote sensing to increase overall accuracy of the best model.
This dataset was retrieved from UCI Machine Learning Repository. It was originally from Jock A. Blackard in USFS and Dr. Denis J. Dean in UT Dallas.The actual forest cover type for a given observation (30 x 30 meter cell) was determined from US Forest Service (USFS) Region 2 Resource Information System (RIS) data. Independent variables were derived from data originally obtained from US Geological Survey (USGS) and USFS data. The are seven forest cover type classes: lodgepole pine, spruce/fir , ponderosa pine (Pinus ponderosa), Douglas-fir, aspen, cottonwood/willow, and krummholz.
Here is a description of all columns in this dataset.
Elevation - Elevation in meters \ Aspect - Aspect in degrees azimuth \ Slope - Slope in degrees \ Horizontal_Distance_To_Hydrology - Horz Dist to nearest surface water features \ Vertical_Distance_To_Hydrology - Vert Dist to nearest surface water features \ Horizontal_Distance_To_Roadways - Horz Dist to nearest roadway \ Hillshade_9am (0 to 255 index) - Hillshade index at 9am, summer solstice \ Hillshade_Noon (0 to 255 index) - Hillshade index at noon, summer solstice \ Hillshade_3pm (0 to 255 index) - Hillshade index at 3pm, summer solstice \ Horizontal_Distance_To_Fire_Points - Horz Dist to nearest wildfire ignition points \ Wilderness_Area (4 binary columns, 0 = absence or 1 = presence) - Wilderness area designation \ Soil_Type (40 binary columns, 0 = absence or 1 = presence) - Soil Type designation \ Cover_Type (7 types, integers 1 to 7) - Forest Cover Type designation \
The 7 different cover types are classifeid as: 1 - Spruce/Fir \ 2 - Lodgepole Pine \ 3 - Ponderosa Pine \ 4 - Cottonwood/Willow \ 5 - Aspen \ 6 - Douglas-fir \ 7 - Krummholz
In this section, I will perform some analysis and basic visualiztion of the tree cover type dataset.
#import basic packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import preprocessing
import io
import warnings
warnings.filterwarnings('ignore')
#conver csv to pandas dataframe
treetype = pd.read_csv('train.csv')
There are 15120 records in the training set and test set is avilible on Kaggle. There are enough data for us to train and validate.
#@title
treetype.shape
(15120, 56)
Here is a look at the first 5 rows of data.
#@title
treetype.head()
| Id | Elevation | Aspect | Slope | Horizontal_Distance_To_Hydrology | Vertical_Distance_To_Hydrology | Horizontal_Distance_To_Roadways | Hillshade_9am | Hillshade_Noon | Hillshade_3pm | ... | Soil_Type32 | Soil_Type33 | Soil_Type34 | Soil_Type35 | Soil_Type36 | Soil_Type37 | Soil_Type38 | Soil_Type39 | Soil_Type40 | Cover_Type | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 2596 | 51 | 3 | 258 | 0 | 510 | 221 | 232 | 148 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 5 |
| 1 | 2 | 2590 | 56 | 2 | 212 | -6 | 390 | 220 | 235 | 151 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 5 |
| 2 | 3 | 2804 | 139 | 9 | 268 | 65 | 3180 | 234 | 238 | 135 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 |
| 3 | 4 | 2785 | 155 | 18 | 242 | 118 | 3090 | 238 | 238 | 122 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 |
| 4 | 5 | 2595 | 45 | 2 | 153 | -1 | 391 | 220 | 234 | 150 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 5 |
5 rows × 56 columns
Some names of attributes are way too long, therefore I renames some columns.
#@title
#Rename some columns to make it simple
treetype=treetype.rename(columns={"Horizontal_Distance_To_Hydrology": "HDis_Hydro", "Vertical_Distance_To_Hydrology": "VDis_Hydro"})
treetype=treetype.rename(columns={"Horizontal_Distance_To_Roadways": "HDis_Rd", "Horizontal_Distance_To_Fire_Points": "HDis_Fire"})
First, we get the basic information and statitics of each varibles.
treetype.describe()
| Id | Elevation | Aspect | Slope | HDis_Hydro | VDis_Hydro | HDis_Rd | Hillshade_9am | Hillshade_Noon | Hillshade_3pm | ... | Soil_Type32 | Soil_Type33 | Soil_Type34 | Soil_Type35 | Soil_Type36 | Soil_Type37 | Soil_Type38 | Soil_Type39 | Soil_Type40 | Cover_Type | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 15120.00000 | 15120.000000 | 15120.000000 | 15120.000000 | 15120.000000 | 15120.000000 | 15120.000000 | 15120.000000 | 15120.000000 | 15120.000000 | ... | 15120.000000 | 15120.000000 | 15120.000000 | 15120.000000 | 15120.000000 | 15120.000000 | 15120.000000 | 15120.000000 | 15120.000000 | 15120.000000 |
| mean | 7560.50000 | 2749.322553 | 156.676653 | 16.501587 | 227.195701 | 51.076521 | 1714.023214 | 212.704299 | 218.965608 | 135.091997 | ... | 0.045635 | 0.040741 | 0.001455 | 0.006746 | 0.000661 | 0.002249 | 0.048148 | 0.043452 | 0.030357 | 4.000000 |
| std | 4364.91237 | 417.678187 | 110.085801 | 8.453927 | 210.075296 | 61.239406 | 1325.066358 | 30.561287 | 22.801966 | 45.895189 | ... | 0.208699 | 0.197696 | 0.038118 | 0.081859 | 0.025710 | 0.047368 | 0.214086 | 0.203880 | 0.171574 | 2.000066 |
| min | 1.00000 | 1863.000000 | 0.000000 | 0.000000 | 0.000000 | -146.000000 | 0.000000 | 0.000000 | 99.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 |
| 25% | 3780.75000 | 2376.000000 | 65.000000 | 10.000000 | 67.000000 | 5.000000 | 764.000000 | 196.000000 | 207.000000 | 106.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 2.000000 |
| 50% | 7560.50000 | 2752.000000 | 126.000000 | 15.000000 | 180.000000 | 32.000000 | 1316.000000 | 220.000000 | 223.000000 | 138.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 4.000000 |
| 75% | 11340.25000 | 3104.000000 | 261.000000 | 22.000000 | 330.000000 | 79.000000 | 2270.000000 | 235.000000 | 235.000000 | 167.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 6.000000 |
| max | 15120.00000 | 3849.000000 | 360.000000 | 52.000000 | 1343.000000 | 554.000000 | 6890.000000 | 254.000000 | 254.000000 | 248.000000 | ... | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 7.000000 |
8 rows × 56 columns
From the pair plot below, we can see elevation can seperate forest cover type the best. Especially when it combine with Aspect and Slope. There are few outliers in the graph, we will look into this to see if its measurement error.
#@title
sns.set_theme(style="ticks")
plt.style.use('seaborn')
treetypee_sub=treetype.iloc[:,np.r_[1:11,-1]].copy()
sns.pairplot(treetypee_sub, hue="Cover_Type",palette="Paired")
<seaborn.axisgrid.PairGrid at 0x160b0358a48>
There is no missing data in this dataset.
treetype.loc[:, treetype.isnull().any()].columns
Index([], dtype='object')
Principle Component Analysis\ Principle Component are not supposed to work with binary data. The results from PCA are not ideal in my opinion. There is no clear pattern of principle Components of different tree species.
from sklearn.decomposition import PCA
from sklearn import preprocessing
# exclude predicted variable
treetype_PCA=treetype.iloc[:,np.r_[1:55]].copy()
# Normalize the data
treetype_normal = preprocessing.normalize(treetype_PCA)
# create new target variable
treetype_target=treetype.iloc[:,-1].copy()
treetype_target=treetype_target.astype('category')
pca = PCA(3)
projected = pca.fit_transform(treetype_normal)
plt.scatter(projected[:, 0], projected[:, 1],
c=treetype_target, edgecolor='none', alpha=0.7,
cmap=plt.cm.get_cmap('Paired_r', 7))
plt.xlabel('component 1')
plt.ylabel('component 2')
plt.colorbar();
PCA performed poorly on this dataset. Let's take a look at the first three priciple Components and plot a 3d graph of them.
import prince
import pprint
from mpl_toolkits.mplot3d import Axes3D
import plotly.offline as pyo
pyo.init_notebook_mode()
import plotly.express as px
fig = px.scatter_3d(projected,x=0,y=1,z=2,color=treetype_target,opacity=0.7,color_discrete_sequence=px.colors.qualitative.D3)
fig.update_traces(marker=dict(size=2))
fig.show()
FAMD
PCA did not quite do the job. Multiple Factor analysis of mixed data is an alternative to PCA. Multiple Factor analysis of mixed data (FAMD) clearly has a better results than PCA here, It gives a better pattern to dissect different forest covers. I created a similar 3-D plot from the FAMD results, and it looks good. However, the explained variablity is still low, and different classes are overlapping.
# Craete new dataframe with categorical variable wilderness and soil from the dummy variables
wilderness=treetype_PCA.iloc[:,np.r_[10:14]].idxmax(axis=1,)
soil=treetype_PCA.iloc[:,np.r_[14:54]].idxmax(axis=1)
# Concat them back together
treetype_FAMD=pd.concat([treetype_PCA.iloc[:,np.r_[0:10]],wilderness,soil],axis=1)
import prince
import pprint
from mpl_toolkits.mplot3d import Axes3D
import plotly.offline as pyo
pyo.init_notebook_mode()
# Instantiate FAMD object
famd = prince.FAMD(
n_components=3,
n_iter=20,
copy=True,
check_input=True,
engine='auto',
random_state=42)
# Fit FAMD object to data
famd = famd.fit(treetype_FAMD)
projected=np.array(famd.row_coordinates(treetype_FAMD))
print(famd.explained_inertia_)
# Plot 3D scatter
import plotly.express as px
fig = px.scatter_3d(projected,x=0,y=1,z=2,color=treetype_target,opacity=0.7,color_discrete_sequence=px.colors.qualitative.D3)
fig.update_traces(marker=dict(size=2))
fig.show()
[0.06930829 0.04863929 0.03894013]
Our first step would be split training/validation and test dataset.
# train test split
from sklearn.model_selection import train_test_split
X=treetype.iloc[:,np.r_[1:55]].copy()
y=treetype.iloc[:,-1].copy()
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2, random_state=42)
# recode target value since multiple packages require first class start as 0, then 1,2,3,...
y_train=y_train-1
y_test=y_test-1
Our task is a multi-class classification task. I included Logistic regression, Linear Discriminant Analysis , K-nearest neighbor, Decision tree, Gaussian Naive Bayes, and support vector machine.
# import all models
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import StratifiedKFold
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
# create model lisst and add models
models = []
models.append(('LR', LogisticRegression(solver='liblinear', multi_class='ovr')))
models.append(('LDA', LinearDiscriminantAnalysis()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('CART', DecisionTreeClassifier()))
models.append(('NB', GaussianNB()))
models.append(('SVM', SVC(gamma='auto',max_iter=500))) # increase the max iteration of SVM
# evaluate each model in turn
results = []
names = []
for name, model in models:
kfold = StratifiedKFold(n_splits=10, random_state=1, shuffle=True)
cv_results = cross_val_score(model, X_train, y_train, cv=kfold, scoring='accuracy',n_jobs=-1,)
results.append(cv_results)
names.append(name)
print('%s: %f (%f)' % (name, cv_results.mean(), cv_results.std()))
LR: 0.666254 (0.006207) LDA: 0.645089 (0.009705) KNN: 0.798445 (0.011194) CART: 0.776619 (0.005253) NB: 0.590692 (0.010171) SVM: 0.662366 (0.011995)
plt.boxplot(results, labels=names)
plt.title('Algorithm Comparison')
plt.show()
K-nearest neighbor has the best performance in the classification algorithm comparision, our next step is try to improve it's performance by tunning the hyperparameters. Also I will add Xgboost and Multilayer Artificial Neural Network algorithms.
KNN
K neatest neighbor is a simple classification technique. There are few hyperparameters need to be tuned including numbers of neighbors, leaf size, Power parameter, and weigts.
from sklearn.model_selection import GridSearchCV
from sklearn.utils import parallel_backend
from sklearn.neighbors import KNeighborsClassifier
#using grid search to get the best estimator
clf1 = KNeighborsClassifier()
#create param dist to pass through grid search
param_dist = {
'n_neighbors': (1,40, 1),
'leaf_size': (1,40,1),
'p': (1,2),
'weights': ('uniform', 'distance'),
'metric': ('minkowski', 'chebyshev'),
}
grid = GridSearchCV(clf1,param_dist,cv = 3,scoring = 'accuracy',n_jobs=1)
with parallel_backend('threading'):
grid.fit(X_train,y_train)
best_estimator = grid.best_estimator_
print(best_estimator)
KNeighborsClassifier(leaf_size=1, n_neighbors=1, p=1)
from sklearn import metrics
from sklearn.preprocessing import StandardScaler
KNN=grid.fit(X_train, y_train)
y_pred_KNN1 =KNN.predict(X_test)
print('BEST K-NEAREST NEIGHBORS MODEL')
print('Accuracy Score - KNN:', metrics.accuracy_score(y_test, y_pred_KNN1))
BEST K-NEAREST NEIGHBORS MODEL Accuracy Score - KNN: 0.8531746031746031
from sklearn.metrics import classification_report
# genereate a classification report
display_labels=[ "spruce/fir","lodgepole pine", "ponderosa pine","cottonwood/willow","aspen","Douglas-fir", "krummholz"]
print(classification_report(y_test, y_pred_KNN1,target_names=display_labels))
precision recall f1-score support
spruce/fir 0.77 0.71 0.74 421
lodgepole pine 0.74 0.65 0.69 438
ponderosa pine 0.89 0.81 0.85 428
cottonwood/willow 0.90 0.97 0.94 449
aspen 0.86 0.97 0.91 416
Douglas-fir 0.86 0.89 0.87 432
krummholz 0.92 0.97 0.94 440
accuracy 0.85 3024
macro avg 0.85 0.85 0.85 3024
weighted avg 0.85 0.85 0.85 3024
The overall accuracy is good at 0.85, The accuracy on lodgepole pine and spruce/fir can be improved. They have a low recall score as a lot of them are classified as each other and aspen.
metrics.plot_confusion_matrix(grid,X_test, y_test, display_labels=["spruce/fir","lodgepole pine", "ponderosa pine","cottonwood/willow","aspen","Douglas-fir", "krummholz"])
<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x160d04f8c88>
I also ploted a multiclass ROC curve from the yellowbrick package, the results are optimal for all kinds of forest covers.
XGBoost
XGBoost is a tree based method that will perform good on many high dimension classification tasks.
import xgboost as xgb
xg_reg = xgb.XGBClassifier(colsample_bytree = 0.5, learning_rate = 0.1,
max_depth = 10, alpha = 5, n_estimators = 100,eval_metric='merror',use_label_encoder=False)
#label must be in [0,num_classs)
xg_reg.fit(X_train,y_train)
preds = xg_reg.predict(X_test)
np.mean(preds == y_test)
0.8465608465608465
#import packages
import optuna
from optuna import Trial, visualization
from optuna.samplers import TPESampler
from xgboost import XGBClassifier
from sklearn.model_selection import cross_val_score
# define the objective for optuna to tune
def objective(trial: Trial,X_train,y_train):
train_X,val_X,train_y,val_y = train_test_split(X_train, y_train, test_size = 0.30,random_state = 101)
param = {
"n_estimators" : trial.suggest_int("n_estimators", 0, 1000),
'max_depth':trial.suggest_int('max_depth', 2, 25),
'reg_alpha':trial.suggest_int('reg_alpha', 0, 5),
'reg_lambda':trial.suggest_int('reg_lambda', 0, 5),
'min_child_weight':trial.suggest_int('min_child_weight', 0, 5),
'gamma':trial.suggest_int('gamma', 0, 5),
'learning_rate':trial.suggest_loguniform('learning_rate',0.005,0.5),
'colsample_bytree':trial.suggest_discrete_uniform('colsample_bytree',0.1,1,0.01),
'nthread' : -1
}
model = XGBClassifier(**param,eval_metric='merror',use_label_encoder=False)
model.fit(train_X,train_y)
return cross_val_score(model,val_X,val_y).mean()
# create optuna study and optimize it
study = optuna.create_study(direction='maximize',sampler=TPESampler())
study.optimize(lambda trial : objective(trial,X_train,y_train),n_trials= 50)
[I 2021-05-06 13:53:55,124] A new study created in memory with name: no-name-e61bf4ca-14c3-4eb8-98f6-7831669c152c [I 2021-05-06 13:54:52,124] Trial 0 finished with value: 0.752551724137931 and parameters: {'n_estimators': 381, 'max_depth': 21, 'reg_alpha': 2, 'reg_lambda': 3, 'min_child_weight': 3, 'gamma': 3, 'learning_rate': 0.012937648948112826, 'colsample_bytree': 0.9}. Best is trial 0 with value: 0.752551724137931. [I 2021-05-06 13:55:27,104] Trial 1 finished with value: 0.7787305025173363 and parameters: {'n_estimators': 749, 'max_depth': 9, 'reg_alpha': 0, 'reg_lambda': 0, 'min_child_weight': 4, 'gamma': 0, 'learning_rate': 0.17935234852216447, 'colsample_bytree': 0.42000000000000004}. Best is trial 1 with value: 0.7787305025173363. [I 2021-05-06 13:55:31,415] Trial 2 finished with value: 0.6889028213166144 and parameters: {'n_estimators': 85, 'max_depth': 8, 'reg_alpha': 5, 'reg_lambda': 1, 'min_child_weight': 4, 'gamma': 2, 'learning_rate': 0.007838379884265789, 'colsample_bytree': 0.16}. Best is trial 1 with value: 0.7787305025173363. [I 2021-05-06 13:56:57,893] Trial 3 finished with value: 0.7459390139640923 and parameters: {'n_estimators': 686, 'max_depth': 15, 'reg_alpha': 1, 'reg_lambda': 2, 'min_child_weight': 1, 'gamma': 3, 'learning_rate': 0.005492415447375306, 'colsample_bytree': 0.24000000000000002}. Best is trial 1 with value: 0.7787305025173363. [I 2021-05-06 13:58:21,857] Trial 4 finished with value: 0.7610954687945284 and parameters: {'n_estimators': 826, 'max_depth': 15, 'reg_alpha': 3, 'reg_lambda': 2, 'min_child_weight': 4, 'gamma': 2, 'learning_rate': 0.03155547049574213, 'colsample_bytree': 0.51}. Best is trial 1 with value: 0.7787305025173363. [I 2021-05-06 13:59:28,768] Trial 5 finished with value: 0.7696373135746177 and parameters: {'n_estimators': 686, 'max_depth': 9, 'reg_alpha': 5, 'reg_lambda': 4, 'min_child_weight': 2, 'gamma': 0, 'learning_rate': 0.01134981365055232, 'colsample_bytree': 0.77}. Best is trial 1 with value: 0.7787305025173363. [I 2021-05-06 14:00:17,027] Trial 6 finished with value: 0.7566854754440961 and parameters: {'n_estimators': 544, 'max_depth': 22, 'reg_alpha': 0, 'reg_lambda': 3, 'min_child_weight': 4, 'gamma': 3, 'learning_rate': 0.03836641331591119, 'colsample_bytree': 0.15000000000000002}. Best is trial 1 with value: 0.7787305025173363. [I 2021-05-06 14:01:45,983] Trial 7 finished with value: 0.782036667616605 and parameters: {'n_estimators': 860, 'max_depth': 13, 'reg_alpha': 5, 'reg_lambda': 3, 'min_child_weight': 0, 'gamma': 0, 'learning_rate': 0.020221541825891372, 'colsample_bytree': 0.93}. Best is trial 7 with value: 0.782036667616605. [I 2021-05-06 14:03:00,203] Trial 8 finished with value: 0.781211361261518 and parameters: {'n_estimators': 686, 'max_depth': 21, 'reg_alpha': 3, 'reg_lambda': 3, 'min_child_weight': 4, 'gamma': 0, 'learning_rate': 0.016775681202707633, 'colsample_bytree': 0.84}. Best is trial 7 with value: 0.782036667616605. [I 2021-05-06 14:03:29,645] Trial 9 finished with value: 0.7831370760900541 and parameters: {'n_estimators': 769, 'max_depth': 14, 'reg_alpha': 3, 'reg_lambda': 1, 'min_child_weight': 3, 'gamma': 0, 'learning_rate': 0.09086579742447577, 'colsample_bytree': 0.38}. Best is trial 9 with value: 0.7831370760900541. [I 2021-05-06 14:04:01,924] Trial 10 finished with value: 0.7299583927044742 and parameters: {'n_estimators': 349, 'max_depth': 25, 'reg_alpha': 4, 'reg_lambda': 0, 'min_child_weight': 2, 'gamma': 5, 'learning_rate': 0.1342566109273973, 'colsample_bytree': 0.36}. Best is trial 9 with value: 0.7831370760900541. [I 2021-05-06 14:04:27,431] Trial 11 finished with value: 0.7274756340837846 and parameters: {'n_estimators': 974, 'max_depth': 2, 'reg_alpha': 4, 'reg_lambda': 5, 'min_child_weight': 0, 'gamma': 1, 'learning_rate': 0.4994863898568257, 'colsample_bytree': 0.6799999999999999}. Best is trial 9 with value: 0.7831370760900541. [I 2021-05-06 14:06:44,363] Trial 12 finished with value: 0.7762515436496628 and parameters: {'n_estimators': 970, 'max_depth': 13, 'reg_alpha': 4, 'reg_lambda': 1, 'min_child_weight': 0, 'gamma': 1, 'learning_rate': 0.09653256522496734, 'colsample_bytree': 0.58}. Best is trial 9 with value: 0.7831370760900541. [I 2021-05-06 14:08:56,289] Trial 13 finished with value: 0.7809366391184573 and parameters: {'n_estimators': 869, 'max_depth': 12, 'reg_alpha': 2, 'reg_lambda': 1, 'min_child_weight': 1, 'gamma': 1, 'learning_rate': 0.06168354908314655, 'colsample_bytree': 1.0}. Best is trial 9 with value: 0.7831370760900541. [I 2021-05-06 14:09:33,425] Trial 14 finished with value: 0.7120459770114942 and parameters: {'n_estimators': 502, 'max_depth': 18, 'reg_alpha': 5, 'reg_lambda': 4, 'min_child_weight': 3, 'gamma': 5, 'learning_rate': 0.2840124911123528, 'colsample_bytree': 0.30000000000000004}. Best is trial 9 with value: 0.7831370760900541. [I 2021-05-06 14:10:10,525] Trial 15 finished with value: 0.7627476014059086 and parameters: {'n_estimators': 904, 'max_depth': 4, 'reg_alpha': 3, 'reg_lambda': 2, 'min_child_weight': 5, 'gamma': 0, 'learning_rate': 0.021007328604034457, 'colsample_bytree': 0.53}. Best is trial 9 with value: 0.7831370760900541. [I 2021-05-06 14:12:01,459] Trial 16 finished with value: 0.7845167664101833 and parameters: {'n_estimators': 585, 'max_depth': 17, 'reg_alpha': 1, 'reg_lambda': 4, 'min_child_weight': 1, 'gamma': 1, 'learning_rate': 0.056116536437048274, 'colsample_bytree': 1.0}. Best is trial 16 with value: 0.7845167664101833. [I 2021-05-06 14:12:26,837] Trial 17 finished with value: 0.7801075330103544 and parameters: {'n_estimators': 179, 'max_depth': 17, 'reg_alpha': 1, 'reg_lambda': 5, 'min_child_weight': 1, 'gamma': 1, 'learning_rate': 0.06881491880166857, 'colsample_bytree': 0.43000000000000005}. Best is trial 16 with value: 0.7845167664101833. [I 2021-05-06 14:13:50,010] Trial 18 finished with value: 0.7842416642918211 and parameters: {'n_estimators': 579, 'max_depth': 19, 'reg_alpha': 1, 'reg_lambda': 4, 'min_child_weight': 2, 'gamma': 2, 'learning_rate': 0.22162692867534786, 'colsample_bytree': 0.64}. Best is trial 16 with value: 0.7845167664101833. [I 2021-05-06 14:15:29,062] Trial 19 finished with value: 0.7572360596561223 and parameters: {'n_estimators': 586, 'max_depth': 25, 'reg_alpha': 1, 'reg_lambda': 4, 'min_child_weight': 2, 'gamma': 4, 'learning_rate': 0.3892194521662738, 'colsample_bytree': 0.69}. Best is trial 16 with value: 0.7845167664101833. [I 2021-05-06 14:16:58,968] Trial 20 finished with value: 0.7779029163104398 and parameters: {'n_estimators': 409, 'max_depth': 18, 'reg_alpha': 0, 'reg_lambda': 5, 'min_child_weight': 1, 'gamma': 2, 'learning_rate': 0.19887550739595639, 'colsample_bytree': 0.98}. Best is trial 16 with value: 0.7845167664101833. [I 2021-05-06 14:18:07,783] Trial 21 finished with value: 0.7798343307685001 and parameters: {'n_estimators': 595, 'max_depth': 19, 'reg_alpha': 2, 'reg_lambda': 4, 'min_child_weight': 3, 'gamma': 1, 'learning_rate': 0.09630404168057936, 'colsample_bytree': 0.63}. Best is trial 16 with value: 0.7845167664101833. [I 2021-05-06 14:19:06,841] Trial 22 finished with value: 0.7787293625914316 and parameters: {'n_estimators': 446, 'max_depth': 16, 'reg_alpha': 1, 'reg_lambda': 4, 'min_child_weight': 2, 'gamma': 2, 'learning_rate': 0.04549512809497024, 'colsample_bytree': 0.45000000000000007}. Best is trial 16 with value: 0.7845167664101833. [I 2021-05-06 14:19:29,204] Trial 23 finished with value: 0.7657794243374181 and parameters: {'n_estimators': 252, 'max_depth': 20, 'reg_alpha': 2, 'reg_lambda': 5, 'min_child_weight': 3, 'gamma': 1, 'learning_rate': 0.09939510245027552, 'colsample_bytree': 0.27}. Best is trial 16 with value: 0.7845167664101833. [I 2021-05-06 14:21:13,609] Trial 24 finished with value: 0.7754220575662581 and parameters: {'n_estimators': 754, 'max_depth': 11, 'reg_alpha': 1, 'reg_lambda': 4, 'min_child_weight': 2, 'gamma': 2, 'learning_rate': 0.2879540626773244, 'colsample_bytree': 0.78}. Best is trial 16 with value: 0.7845167664101833. [I 2021-05-06 14:22:23,019] Trial 25 finished with value: 0.7820370475919065 and parameters: {'n_estimators': 630, 'max_depth': 23, 'reg_alpha': 2, 'reg_lambda': 1, 'min_child_weight': 1, 'gamma': 1, 'learning_rate': 0.161657803672463, 'colsample_bytree': 0.35}. Best is trial 16 with value: 0.7845167664101833. [I 2021-05-06 14:23:03,214] Trial 26 finished with value: 0.7707403818751781 and parameters: {'n_estimators': 776, 'max_depth': 15, 'reg_alpha': 3, 'reg_lambda': 3, 'min_child_weight': 3, 'gamma': 0, 'learning_rate': 0.0651952359279555, 'colsample_bytree': 0.2}. Best is trial 16 with value: 0.7845167664101833. [I 2021-05-06 14:24:12,402] Trial 27 finished with value: 0.7677077989930654 and parameters: {'n_estimators': 485, 'max_depth': 17, 'reg_alpha': 0, 'reg_lambda': 2, 'min_child_weight': 2, 'gamma': 4, 'learning_rate': 0.032391998890756536, 'colsample_bytree': 0.35}. Best is trial 16 with value: 0.7845167664101833. [I 2021-05-06 14:24:43,313] Trial 28 finished with value: 0.7608188467749597 and parameters: {'n_estimators': 650, 'max_depth': 11, 'reg_alpha': 1, 'reg_lambda': 0, 'min_child_weight': 5, 'gamma': 2, 'learning_rate': 0.24671267704634786, 'colsample_bytree': 0.1}. Best is trial 16 with value: 0.7845167664101833. [I 2021-05-06 14:25:36,061] Trial 29 finished with value: 0.7558601690890093 and parameters: {'n_estimators': 334, 'max_depth': 23, 'reg_alpha': 2, 'reg_lambda': 3, 'min_child_weight': 3, 'gamma': 3, 'learning_rate': 0.13908754295468986, 'colsample_bytree': 0.88}. Best is trial 16 with value: 0.7845167664101833. [I 2021-05-06 14:26:45,814] Trial 30 finished with value: 0.7751480953738007 and parameters: {'n_estimators': 560, 'max_depth': 20, 'reg_alpha': 3, 'reg_lambda': 5, 'min_child_weight': 1, 'gamma': 1, 'learning_rate': 0.11179245918220719, 'colsample_bytree': 0.49}. Best is trial 16 with value: 0.7845167664101833. [I 2021-05-06 14:27:53,081] Trial 31 finished with value: 0.7748726132801368 and parameters: {'n_estimators': 630, 'max_depth': 23, 'reg_alpha': 2, 'reg_lambda': 1, 'min_child_weight': 1, 'gamma': 1, 'learning_rate': 0.17611491158849385, 'colsample_bytree': 0.33}. Best is trial 16 with value: 0.7845167664101833. [I 2021-05-06 14:28:11,258] Trial 32 finished with value: 0.7845156264842785 and parameters: {'n_estimators': 744, 'max_depth': 23, 'reg_alpha': 1, 'reg_lambda': 1, 'min_child_weight': 1, 'gamma': 0, 'learning_rate': 0.4091846827584354, 'colsample_bytree': 0.38}. Best is trial 16 with value: 0.7845167664101833. [I 2021-05-06 14:28:33,305] Trial 33 finished with value: 0.7916804407713499 and parameters: {'n_estimators': 795, 'max_depth': 19, 'reg_alpha': 0, 'reg_lambda': 0, 'min_child_weight': 0, 'gamma': 0, 'learning_rate': 0.4957003744134343, 'colsample_bytree': 0.45999999999999996}. Best is trial 33 with value: 0.7916804407713499. [I 2021-05-06 14:28:54,459] Trial 34 finished with value: 0.8043605965612235 and parameters: {'n_estimators': 708, 'max_depth': 21, 'reg_alpha': 0, 'reg_lambda': 0, 'min_child_weight': 0, 'gamma': 0, 'learning_rate': 0.45550120993772997, 'colsample_bytree': 0.56}. Best is trial 34 with value: 0.8043605965612235. [I 2021-05-06 14:29:17,423] Trial 35 finished with value: 0.8038107722998005 and parameters: {'n_estimators': 818, 'max_depth': 21, 'reg_alpha': 0, 'reg_lambda': 0, 'min_child_weight': 0, 'gamma': 0, 'learning_rate': 0.44596307033053684, 'colsample_bytree': 0.5700000000000001}. Best is trial 34 with value: 0.8043605965612235. [I 2021-05-06 14:29:41,576] Trial 36 finished with value: 0.7988479148855324 and parameters: {'n_estimators': 923, 'max_depth': 21, 'reg_alpha': 0, 'reg_lambda': 0, 'min_child_weight': 0, 'gamma': 0, 'learning_rate': 0.4841614935851065, 'colsample_bytree': 0.48}. Best is trial 34 with value: 0.8043605965612235. [I 2021-05-06 14:30:05,066] Trial 37 finished with value: 0.805184383015104 and parameters: {'n_estimators': 925, 'max_depth': 21, 'reg_alpha': 0, 'reg_lambda': 0, 'min_child_weight': 0, 'gamma': 0, 'learning_rate': 0.4789171985839456, 'colsample_bytree': 0.5700000000000001}. Best is trial 37 with value: 0.805184383015104. [I 2021-05-06 14:30:32,225] Trial 38 finished with value: 0.8035333903296286 and parameters: {'n_estimators': 927, 'max_depth': 21, 'reg_alpha': 0, 'reg_lambda': 0, 'min_child_weight': 0, 'gamma': 0, 'learning_rate': 0.35612196973290194, 'colsample_bytree': 0.55}. Best is trial 37 with value: 0.805184383015104. [I 2021-05-06 14:30:59,991] Trial 39 finished with value: 0.8065640733352334 and parameters: {'n_estimators': 996, 'max_depth': 25, 'reg_alpha': 0, 'reg_lambda': 0, 'min_child_weight': 0, 'gamma': 0, 'learning_rate': 0.36458866385213856, 'colsample_bytree': 0.59}. Best is trial 39 with value: 0.8065640733352334. [I 2021-05-06 14:31:31,816] Trial 40 finished with value: 0.8024276622019568 and parameters: {'n_estimators': 992, 'max_depth': 25, 'reg_alpha': 0, 'reg_lambda': 0, 'min_child_weight': 0, 'gamma': 0, 'learning_rate': 0.32633292558947574, 'colsample_bytree': 0.59}. Best is trial 39 with value: 0.8065640733352334. [I 2021-05-06 14:31:59,773] Trial 41 finished with value: 0.8024318419302745 and parameters: {'n_estimators': 927, 'max_depth': 22, 'reg_alpha': 0, 'reg_lambda': 0, 'min_child_weight': 0, 'gamma': 0, 'learning_rate': 0.3692832950225812, 'colsample_bytree': 0.55}. Best is trial 39 with value: 0.8065640733352334. [I 2021-05-06 14:32:25,579] Trial 42 finished with value: 0.7999498432601881 and parameters: {'n_estimators': 856, 'max_depth': 24, 'reg_alpha': 0, 'reg_lambda': 0, 'min_child_weight': 0, 'gamma': 0, 'learning_rate': 0.4453601362763901, 'colsample_bytree': 0.74}. Best is trial 39 with value: 0.8065640733352334. [I 2021-05-06 14:32:52,399] Trial 43 finished with value: 0.8060115892466989 and parameters: {'n_estimators': 814, 'max_depth': 21, 'reg_alpha': 0, 'reg_lambda': 0, 'min_child_weight': 0, 'gamma': 0, 'learning_rate': 0.3376285799609253, 'colsample_bytree': 0.6}. Best is trial 39 with value: 0.8065640733352334. [I 2021-05-06 14:33:24,109] Trial 44 finished with value: 0.8018782179158354 and parameters: {'n_estimators': 823, 'max_depth': 22, 'reg_alpha': 0, 'reg_lambda': 0, 'min_child_weight': 0, 'gamma': 0, 'learning_rate': 0.2614578157701639, 'colsample_bytree': 0.62}. Best is trial 39 with value: 0.8065640733352334. [I 2021-05-06 14:33:46,051] Trial 45 finished with value: 0.7960904341217822 and parameters: {'n_estimators': 699, 'max_depth': 24, 'reg_alpha': 0, 'reg_lambda': 0, 'min_child_weight': 0, 'gamma': 0, 'learning_rate': 0.4861224982713907, 'colsample_bytree': 0.66}. Best is trial 39 with value: 0.8065640733352334. [I 2021-05-06 14:36:45,108] Trial 46 finished with value: 0.8071138975966562 and parameters: {'n_estimators': 877, 'max_depth': 20, 'reg_alpha': 0, 'reg_lambda': 1, 'min_child_weight': 0, 'gamma': 0, 'learning_rate': 0.3554574599640166, 'colsample_bytree': 0.73}. Best is trial 46 with value: 0.8071138975966562. [I 2021-05-06 14:40:13,676] Trial 47 finished with value: 0.8076644818086823 and parameters: {'n_estimators': 1000, 'max_depth': 20, 'reg_alpha': 0, 'reg_lambda': 1, 'min_child_weight': 0, 'gamma': 0, 'learning_rate': 0.3184523600045855, 'colsample_bytree': 0.73}. Best is trial 47 with value: 0.8076644818086823. [I 2021-05-06 14:43:33,819] Trial 48 finished with value: 0.8060108292960958 and parameters: {'n_estimators': 994, 'max_depth': 20, 'reg_alpha': 0, 'reg_lambda': 1, 'min_child_weight': 0, 'gamma': 0, 'learning_rate': 0.31287758011509764, 'colsample_bytree': 0.74}. Best is trial 47 with value: 0.8076644818086823. [I 2021-05-06 14:46:41,747] Trial 49 finished with value: 0.7900290681105728 and parameters: {'n_estimators': 988, 'max_depth': 18, 'reg_alpha': 1, 'reg_lambda': 1, 'min_child_weight': 1, 'gamma': 1, 'learning_rate': 0.005790077471872208, 'colsample_bytree': 0.72}. Best is trial 47 with value: 0.8076644818086823.
print('Best trial: score {},\nparams {}'.format(study.best_trial.value,study.best_trial.params))
Best trial: score 0.8076644818086823,
params {'n_estimators': 1000, 'max_depth': 20, 'reg_alpha': 0, 'reg_lambda': 1, 'min_child_weight': 0, 'gamma': 0, 'learning_rate': 0.3184523600045855, 'colsample_bytree': 0.73}
This is a optimization history plot of all trials as well as the best score at each point.
optuna.visualization.plot_optimization_history(study)
optuna.visualization.plot_slice(study)
The accuracy of the model is 87%
model = XGBClassifier(**study.best_trial.params,eval_metric='merror',use_label_encoder=False)
model.fit(X_train,y_train)
preds = model.predict(X_test)
np.mean(preds == y_test)
0.8908730158730159
Xgboost is definitely better than KNN at classifying this dataset, we see less classification error on Lodgepole pine and Spruce/Fir here. They are no more likely to be classified as Aspen.
metrics.plot_confusion_matrix(model,X_test, y_test, display_labels=["spruce/fir","lodgepole pine", "ponderosa pine","cottonwood/willow","aspen","Douglas-fir", "krummholz"])
<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x160896eb608>
from yellowbrick.classifier import ROCAUC
visualizer = ROCAUC(model, classes=["spruce/fir","lodgepole pine", "ponderosa pine","cottonwood/willow","aspen","Douglas-fir", "krummholz"])
visualizer.fit(X_train, y_train) # Fit the training data to the visualizer
visualizer.score(X_test, y_test) # Evaluate the model on the test data
visualizer.show() # Finalize and show the figure
<AxesSubplot:title={'center':'ROC Curves for XGBClassifier'}, xlabel='False Positive Rate', ylabel='True Positive Rate'>
from sklearn.metrics import classification_report
display_labels=["spruce/fir","lodgepole pine", "ponderosa pine","cottonwood/willow","aspen","Douglas-fir", "krummholz"]
print(classification_report(y_test, preds,target_names=display_labels))
precision recall f1-score support
spruce/fir 0.82 0.81 0.81 421
lodgepole pine 0.84 0.73 0.78 438
ponderosa pine 0.87 0.87 0.87 428
cottonwood/willow 0.95 0.98 0.96 449
aspen 0.91 0.96 0.93 416
Douglas-fir 0.89 0.89 0.89 432
krummholz 0.95 0.98 0.97 440
accuracy 0.89 3024
macro avg 0.89 0.89 0.89 3024
weighted avg 0.89 0.89 0.89 3024
Neural Network
Artifical Neural Network usually underperform on structured data, but we will give it a try here.
import tensorflow as tf
from tensorflow import keras
import kerastuner as kt
#Check GPU usage
print("Num GPUs Available: ", len(tf.config.list_physical_devices('GPU')))
Num GPUs Available: 1
Here we are using keras tuner to optimize the number of layers, neuron in each layer, learning rate.
def model_builder(hp):
model = keras.Sequential()
# Tune the number of units in the first Dense layer
# Choose an optimal value between 32-512
for i in range(hp.Int('num_layers', 1, 15)):
model.add(keras.layers.Dense(units=hp.Int('units_' + str(i),
min_value=16,
max_value=512,
step=16),
activation='relu'))
model.add(keras.layers.Dense(7, activation='softmax'))
# Tune the learning rate for the optimizer
# Choose an optimal value from 0.01, 0.001, or 0.0001
hp_learning_rate = hp.Choice('learning_rate', values=[1e-2, 1e-3, 1e-4])
model.compile(optimizer=keras.optimizers.Adam(learning_rate=hp_learning_rate),
loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True),
metrics=['accuracy'])
return model
tuner = kt.Hyperband(
model_builder,
max_epochs=30,
factor=3,
objective='val_accuracy',
directory='forest_cover2',
project_name='kt')
tuner.search(X_train,y_train,
epochs=50,
validation_split=0.2)
Trial 90 Complete [00h 01m 06s] val_accuracy: 0.7363636493682861 Best val_accuracy So Far: 0.7433884143829346 Total elapsed time: 00h 17m 05s INFO:tensorflow:Oracle triggered exit
# Retrieve the best model.
best_model = tuner.get_best_models(num_models=1)[0]
# Evaluate the best model.
loss, accuracy = best_model.evaluate(X_test, y_test)
95/95 [==============================] - 1s 3ms/step - loss: 0.5664 - accuracy: 0.7588
It is able to achieve 77% accuracy. The performance is not quite good as XGBoost, but it is on par with K-nearest neighbor. However, since K-nearest neighbor is a much easier method computationaly. K-neareset neighbor is definitely a better choice than ANN.
Overall, I was able to achieve good prediction accuracy on test dataset from K-nearest neighbor algorithm and XGBoost. XGBoost especially increase the precision and recall on classification of spruce/fir and Lodgepole pine. XGBoost was able to get a accuracy of 87.8%. Neural Network, on the other hand, was able to achieve an accuracy of . The major problem of classifying Lodgepole pine is that this species has widest range of environmental tolerance of any conifer in North America. It can adapt different climate and become minor in warm ,moist places and dominant in cold, dry places.
To further improve the accuracy score, I think the introduction of weather data might help with the misclassification situation here.